========================================================

This report explores a dataset containing attributes of approximately 4900 wine samples

Univariate Plots Section

## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ S.No                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##       S.No      fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

The dataset contains 13 variables with almost 4900 observations.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Most of the white wine observations have quality rating between 5 to 7: Median 6 and Mean 5.8

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Although the count of observations is at peak at 9.5 percentage alocohol content, most of the observations have alcohol percentage between 9 to 11.Median and Mean alcohol content is 10.40 and 10.51

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The pH value is normally distributed among the observations and most of them lie between the range of 2.9 to 3.4 on pH scale. Median is 3.180 and Mean is 3.188

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density value is normally distributed with very few outliers.Most of the wine observations lie between the range of 0.991 to 0.996. Median is 0.9937 and Mean is 0.9940

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The first plot for fixed acidity is normally distributed with few outliers. To better understand the distribution of fixed acidity level, boxplot is done with outliers and the highest count of wine samples is at around 6.8 of fixed acidity level.Most of the wine samples lies between the fixed acity range of 6.3 to 7.3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

The first plot for volatile acidity is normally distributed with few outliers. To better understand the distribution of volatile acidity level, boxplot is done with outliers and the highest count of wine samples is at around 0.28 of fixed acidity level.Most of the wine samples lies below the volatile acity range of 0.32.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Citric acid content is normally distributed with outliers.To better understand the distribution of citric acid level, boxplot is done with outliers and the highest count of wine samples is at around 0.34 of citric acid level.Most of the wine samples lies below the range of 0.39.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Distribution of residual sugar is right skewed with more than 75% of white wine samples having residual sugar content below 10 and having few outliers.To better understand the distribution of residual sugar, boxplot is done and the highest count of wine samples is at around 6 of residual sugar content level.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
## [1] 1163

White wine samples have very low level of chlorides content and more than 75% of samples have below 0.05 chlorides content. Count of samples with chlorides content above 0.05 is 1163. There is a wide variation between minimum and maximum chlorides content level with miminum value as 0.009 and maximum value as 0.346.Due to large number of outliers the distribution of chlorides content is skewed far to the right.To better understand the distribution of chlorides the long tail data is transformed on a log scale of 10. The transformed chlorides distribution appears like a normal distribution with the highest count of samples at around 0.045 of Chlorides content value.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Distribution of free sulfur dioxide is more or less normal with few outliers.To better understand the distribution of free sulfur dioxide content, boxplot is done with outliers and the highest count of wine samples is at around 35 of free sulfur dioxide content level.75% of the wine samples lie below 46 of free sulfur dioxide content level.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Distribution of total sulfur dioxide is more or less normal with few outliers.To better understand the distribution of total sulfur dioxide content, boxplot is done with outliers and the highest count of wine samples is at around 138.4 of total sulfur dioxide content level.75% of the wine samples lie below 167 of total sulfur dioxide content level.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

75 % of white wine samples have 0.55 sulphates level and mean is 0.4898.

Univariate Analysis

What is the structure of your dataset?

There are 4898 white wine samples with 12 attributes(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality) . Following are the observations from the dataset: 1) Most of the white wine observations have quality score of 6 2) Around 75% of the observations had residual sugar content less than 10 while the minimum and maximum values are 0.6 and 65.800 respectively. 3) Mean and median alcohol content are 10.40 and 10.51 4) pH value is normally distributed among the observations with mean at 3.188 5) There are 4 outliers with fixed acidity greater than 10.5 and 6 outliers with volatile acidity > 0.9

What is/are the main feature(s) of interest in your dataset?

The main features of interest in the dataset are quality, alcohol and residual sugar content. I believe pH and residual sugar with the combination of other attributes can be used to build a predictive model for white wine quality.

What other features in the dataset do you think will help support your into your feature(s) of interest?

Citric acid content, density and Chlorides content may contribute more towards the quality of the wine.

Did you create any new variables from existing variables in the dataset?

I have not created any new variable from the existing variables as each one is discrete.

Of the features you investigated, were there any unusual distributions?Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I did not perform any operations on the data to tidy, adjust or change the form of the data.

Bivariate Plots Section

## df$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400 
## -------------------------------------------------------- 
## df$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0130  0.0380  0.0460  0.0501  0.0540  0.2900 
## -------------------------------------------------------- 
## df$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600 
## -------------------------------------------------------- 
## df$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500 
## -------------------------------------------------------- 
## df$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## -------------------------------------------------------- 
## df$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100 
## -------------------------------------------------------- 
## df$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0180  0.0210  0.0310  0.0274  0.0320  0.0350

The above plots shows that the high quality wine samples have less content of chlorides.

White wine samples with high content of fixed acidity, volatile acidity, citric acid and residual sugar have less chlorides content. This might have an impact on the quality of wine.

The above plot supports the assumption that with the increase in content of fixed acidity, volatile acidtiy, citric acid and residual sugar, the quality decreases.

Density is related to residual sugar and alcohol. As we can see from the above plot density increases as the residual sugar increases and density decreases as the alcohol content increases.

As expected, as fixed acidity increases the pH value decreases

## df$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.587   4.600   6.393  10.700  16.200 
## -------------------------------------------------------- 
## df$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.300   2.500   4.628   7.100  17.550 
## -------------------------------------------------------- 
## df$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   7.000   7.335  11.500  23.500 
## -------------------------------------------------------- 
## df$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.700   5.300   6.442   9.900  65.800 
## -------------------------------------------------------- 
## df$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.700   3.650   5.186   7.325  19.250 
## -------------------------------------------------------- 
## df$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   2.100   4.300   5.671   8.200  14.800 
## -------------------------------------------------------- 
## df$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60    2.00    2.20    4.12    4.20   10.60

There is a relation between quality and residual sugar.After it reaches quality rating of 5, quality increases with the decrease in residual sugar content.

## df$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.870   3.035   3.215   3.188   3.325   3.550 
## -------------------------------------------------------- 
## df$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.830   3.070   3.160   3.183   3.280   3.720 
## -------------------------------------------------------- 
## df$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.790   3.080   3.160   3.169   3.240   3.790 
## -------------------------------------------------------- 
## df$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.080   3.180   3.189   3.280   3.810 
## -------------------------------------------------------- 
## df$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.840   3.100   3.200   3.214   3.320   3.820 
## -------------------------------------------------------- 
## df$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.940   3.120   3.230   3.219   3.330   3.590 
## -------------------------------------------------------- 
## df$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.200   3.280   3.280   3.308   3.370   3.410

This plot shows that the pH values varies accross each quality rating with the most variation in the quality rating of 6.After the quality rating of 5, the mean pH value increases with the increase in quality.

## df$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.35   11.00   12.60 
## -------------------------------------------------------- 
## df$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## df$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## df$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## df$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## df$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## df$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

The above plots cleary show that the quality increases with the increase in alcohol content. The summary shows that after the quality rating of 5, the mean value of alcohol content gradually increases with the increase in quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There is a strong relation between alcohol content and quality. And in line with my intuition there is a strong corelation between residual sugar and quality.Chlorides content decreases with the increase of fixed acidity, volatile acidity,citric acid and residual sugar.They could also have an impact on quality of wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Density had a strong relationship with residual sugar and alcohol content.pH value had a relationship with fixed acidity.

What was the strongest relationship you found?

Alcohol content has a strong relationship with quality. Density also has a strong relationship with residual sugar and alcohol content.As expected pH value also has a relationship with fixed acidity.

Multivariate Plots Section

Most of the high quality wines have low chlorides content which can be seen in the above plot.

Residual sugar content is less in high quality wines.Increase in residual sugar content leads to increase in density.

The above plot shows that the alcohol content is high and chlorides content is less in high quality wine sample.

The above plot shows that the chlorides content reduces with the increase of fixed acidity, volatile acidity, citric acid and residual sugar content.And high quality white wine samples have less chlorides content.

The above plot shows that the density increases with increase in residual sugar content and density decreases with increase in alcohol content. Most of the high quality wine samples have less residual sugar content,high alcohol content and low density.

Here again the above plot shows that most of the high quality wine samples have less residual sugar content and high alcohol content.

The above plot shows that most of the high quality wine samples have less chlorides content and high alcohol content.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Alcohol conent, residual sugar and chlorides content play an important role in determing the quality of wine which can be seen from the above plots

Were there any interesting or surprising interactions between features?

One of the interesting interaction is between density and residual sugar & alcohol content.

Final Plots and Summary

Plot One

## df$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400 
## -------------------------------------------------------- 
## df$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0130  0.0380  0.0460  0.0501  0.0540  0.2900 
## -------------------------------------------------------- 
## df$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600 
## -------------------------------------------------------- 
## df$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500 
## -------------------------------------------------------- 
## df$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## -------------------------------------------------------- 
## df$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100 
## -------------------------------------------------------- 
## df$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0180  0.0210  0.0310  0.0274  0.0320  0.0350

Description One

The above plot shows that the high quality wine samples have less content of chlorides.The mean value of chlorides content decreases from the quality rating of 6.

Plot Two

Description Two

The above plot shows that the density increases with increase in residual sugar content and density decreases with increase in alcohol content. Most of the high quality wine samples have less residual sugar content and high alcohol content.

Plot Three

Description Three

The above plot shows that most of the high quality white wine samples have less chlorides content and high alcohol content.

Reflection

The White wine data set contains information on almost 4900 thousand white wine samples across 12 attributes . I started by understanding the individual attributes in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of white wine samples across many attributes and created a plots to predict the quality of white wine.

The exploration of white wine dataset shows that the quality of white wine is largly bases on the alcohol, residual sugar and chlorides content.High quality white wine samples have high alcohol content and low residual sugar and chlorides content.

In future I would like to explore impact of acidity feature of the white wine and how does it impact the quality.